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Abstract 

We propose a method combining relational-logic representations with neural network learn¬ 
ing. A general lifted architecture, possibly reflecting some background domain knowledge, 
is described through relational rules which may be handcrafted or learned. The relational 
rule-set serves as a template for unfolding possibly deep neural networks whose struc¬ 
tures also reflect the structures of given training or testing relational examples. Different 
networks corresponding to different examples share their weights, which co-evolve during 
training by stochastic gradient descent algorithm. The framework allows for hierarchical 
relational modeling constructs and learning of latent relational concepts through shared 
hidden layers weights corresponding to the rules. Discovery of notable relational concepts 
and experiments on 78 relational learning benchmarks demonstrate favorable performance 
of the method. 

Keywords: Relational learning, Lifted models, Neural networks 

1. Introduction 

Lifted models also known as templated models have attracted significant attention recently 
(Kimmig et al., 2015) in areas such as statistical relational learning. Lifted models define 
patterns from which specific (ground) models can be unfolded. For example, a lifted Markov 
network model (Richardson and Domingos, 2006) may express that friends of smokers tend 
to be smokers and such a pattern then constrains the probabilistic relationships in all sets of 
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vertices corresponding to particular friends-smokers in the derived ground Markov network. 
The lifted patterns are typically encoded in relational logic-based languages. 

Here we contribute a method for (deep) lifted feed-forward neural network learning, in 
which the ground network structure is unfolded from a set of weighted rules in relational 
logic. The relational rules are instantly interpretable and can be handcrafted by a domain 
expert or learned, e.g. through techniques of inductive logic programming (De Raedt, 2008). 
Weights of the ground neural networks are determined by the weighted relational rules and 
can be learned by stochastic gradient descent algorithm. This means that weights between 
different ground neurons constructed from the same relational rule are tied in our framework, 
similarly to how weights are shared in lifted graphical models in statistical relational learning 
or how weights are tied together by application of filters in convolutional neural networks 
in deep learning. 

A salient property of our approach distinguishing it from previous studies on adapting 
neural networks for relational learning is that the ground network structure depends not 
only on the relational rule set but also on a particular example, i.e., different networks are 
constructed for different examples to exploit their particular relational properties. However, 
the different networks share their weights as these are all bound to the relational rules, and 
so weight-updates performed for one training example are reflected in networks produced 
for other examples, which allows the model to learn directly from relational data. 

The main advantage of the presented approach is that it can effectively learn weights 
of latent relational structures. This is a difficult task for existing lifted systems based on 
probabilistic inference because there one typically needs to run expensive expectation maxi¬ 
mization algorithms in order to learn parameters when latent structures are present. On the 
other hand, deep neural networks, which we exploit in our work, have been shown to effec¬ 
tively learn latent structures, although obviously only in the ground non-relational settings. 
By combining relational logic with deep neural networks, we obtain a framework flexible 
enough to learn weights of latent relational structures, which we also verify experimentally. 
While there have been several works combining propositional or relational logic with neural 
networks (Towell et al., 1990; Botta et al., 1997; Franga et ah, 2014), none of the existing 
methods is able to learn weights of latent non-ground relational structures 1 . 

The rest of the paper is organized as follows. The next section briefly summarizes the 
preliminaries regarding relational logic and the assumed neural network paradigm. Section 
3 explains the principles of the proposed Lifted Relational Neural Networks method. Sec¬ 
tion 4 describes useful modeling constructs. In Section 5, we show how weight-learning is 
implemented in it. Section 6 places the presented methods in the context of existing works. 
In Section 7, we subject the method to comparative experimental evaluation on relational 
learning benchmarks and then conclude the paper. 

2. Preliminaries 

A first-order logic theory is a set of formulas formed from constants, variables, functions, 
and predicates (Smullyan, 1995). Constant symbols represent objects in the domain of 
interest (e.g. alice ) and will be written in lower-case. Variables (e.g. Person ) range over the 

1. What we mean by latent relational structures will be better explained in Section 4 where we present 
several types of latent structures which can be used in our framework. 
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objects in the domain and will be written with capitalized first letter. Function symbols 
will not be used in this paper. Predicate symbols represent relations among objects in the 
domain or their attributes. A term may be a constant or variable (or a function symbol 
applied to a tuple of terms). An atom is a predicate symbol applied to a tuple of terms 
(e.g. friends(X, bob)). Formulas are constructed from atoms using logical connectives and 
quantifiers. A ground term is a term containing no variables. A ground atom is an atom 
having only ground terms as arguments (e.g. friends(alice, bob)). A literal is an atom or 
a negation of an atom (which is also called a negative literal). A clause is a universally 
quantified disjunction of literals. When there is no risk of confusion, we will not write the 
universal quantifiers explicitly. A clause with exactly one positive literal is a definite clause. 
A definite clause with no negative literals (i.e. consisting of just one literal) is called a fact. 
A definite clause h V — V • • • V -bk can also be written as an implication h <— bi A ■ ■ • A bk- 
The literal h is then called head and the conjunction b\ A ■ ■ • A bk is called body. We will 
sometimes call definite clauses, which are not facts, rules. 

Given a first-order logic theory, the set of all ground atoms which can be constructed 
using the constants, function symbols and predicates present in the theory is its Herbrand 
base. A Herbrand interpretation , also called possible world, assigns a truth value to each 
possible ground atom from a given Herbrand base. A set of formulas is satisfiable if there 
exists at least one world in which all formulas from the set are true; such a world is its Her¬ 
brand model. A satisfiable set of definite clauses has a least Herbrand model and this model 
is unique. The least Herbrand model of a function-free set of definite clauses (i.e. a Datalog 
theory) can be constructed in finite number of steps using the immediate-consequence oper¬ 
ator (Van Emden and Kowalski, 1976). Immediate consequence operator T p maps the space 
of Herbrand interpretations over some Herbrand base B back to itself as T p : 1(B) i —> 1(B). 
The mapping of T v is directly prescribed by the theory V such that for / E Z(£>) the 
T p (I) = {h\(h •(— b\ A ■ ■ ■ A bk) E V} and bi A ■ ■ ■ A bk C I. In other words the operator T p 
expands the current set of true atoms (interpretation I) with their immediate consequences 
as prescribed by the rules in V. 

An artificial neural network (NN) is a biologically inspired mathematical model, con¬ 
sisting of interconnected processing units called neurons, each of which is associated with 
an activation function gi E Q from some predefined family of differentiable functions. Neu¬ 
ral network then defines a mapping / : M m H > M n of input space to target space vectors, 
parameterized by a set of weights w l j E M. Following the pattern of neural interconnections, 
the mapping / can be seen as a composition of activation functions gi E G. For feed for¬ 
ward neural networks it is typically a hierarchical compound of non-linear weighted sums 
9i(Ylj w jSj(Ylk u; fe +1 9fe(- • ■)))> which can be conveniently depicted as a weighted directed 
acyclic graph of neurons (e.g. Fig 3). By adapting the weights w'- E W the model can be 
learned to approximate some target function t : M m <—> M n . This is typically performed 
by some sort of gradient descent minimization of a given cost function cost : { W,T>} h > M 
capturing discrepancy between / and t upon some set of training samples (xd,t(xd)) G T>. 

3. Lifted Relational Neural Networks 

A lifted relational neural network (LRNN) AT is a set of weighted definite clauses, i.e. pairs 
(. Ri,Wi ) where Ri is a function-free definite clause and Wi is a real number. When M is a 
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set of weighted definite clauses, Af* will denote the corresponding set of the definite clauses 
without weights, i.e. AT* = {C : ( C,w) £ AT}. The set Af must satisfy the following non¬ 
recursiveness 2 requirement: there must exist a strict ordering -< of predicates such that if 
there is a rule with a predicate pi in the head and a predicate P 2 in the body then p\ <p 2 - 
Given a LRNN Af, let R be the least Herbrand model of AT*. We define ground¬ 
ing of the LRNN Af as Af = {(hd 4- bid A • • • A bkO,w ) : (h 4— b\ A • ■ ■ A bk,w) £ 
Af and {hd, bid,..., b^d} C R}. That is, Af is defined as the set of ground definite clauses 
which can be obtained by grounding rules from the LRNN and which are active in the least 
Herbrand model of Af* (a rule is active in R if its body is true in R). As already outlined in 
Introduction, LRNNs are templates for creating ground neural networks. The requirement 
that ground rules should be active in R is beneficial for practice because it provides us with 
flexibility in controlling complexity of the constructed neural networks. 

Example 1 Let 

Af ={(mother(C, M) 4— parent(C , M) A female(M), 1), 

(father(C, F ) 4— parent(C, F) A male(F), 2), 

(female(alice), 1), ( parent(bob, alice ), 1), ( parent(eve, alice), 1)}. 

Then for its grounding we have 

Af ={(mother(bob, alice) 4— parent(bob, alice) A female(alice), 1), 

( mother(eve, alice) 4- parent(eve, alice) A female(alice ), 1), 

(female(alice), 1), ( parent(bob, alice), 1), ( parent(eve, alice), 1)}. 

Notice that Af does not contain the predicates male/1 or father /2 as there are no ground 
atoms based on them in the least Herbrand model of Af. 

Definition 1 Let Af be a LRNN, and let Af be its grounding. Let gy, g/\ and g\ be families 
of multivariate functions with exactly one function for each number of arguments. The 
ground neural network of Af is a feedforward neural network constructed as follows. 

• For every ground atom h occurring in Af, there is a neuron Ah, called atom neuron. 
The activation functions of atom neurons are from the family gy. 

• For every ground fact (h, w) £ Af, there is a neuron Fi h w \, called fact neuron, which 
has no input and always outputs a constant value. 

• For every ground rule hd 4— bid A • • • A bkd G Af , there is a neuron Rh()^btOA---Ab k (h 
called rule neuron. It has the atom neurons A^g,... ,Ag k g as inputs, all with weight 
1. The activation functions of rule neurons are from the family g/\. 

2. The reason why we do not allow recursion will be clearer when we explain weight learning in the next 
section. Here, we just note that whereas rule sets without recursion will lead to optimization problems 
solvable by an algorithm which is basically a modified back-propagation algorithm, rule sets with recur¬ 
sion would lead to more complicated optimization problems which would not directly allow us to exploit 
existing results on training feedforward neural networks. 
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• For every rule (h b\ A • • • A bk, w) E M and every hO E TL, there is a neu¬ 
ron Agg^^ bi/ \__. Abk w y called aggregation neuron. Its inputs are all rule neurons 

where hO = hO' with all weights equal to 1. The activation func¬ 
tions of the aggregation neurons are from the family gf. 

• Inputs of an atom neuron A^q are the aggregation neurons w \ and fact 

neurons Ft h Q w -\ • The weights of the input neurons are the respective w’s. 

Example 2 Let us consider the following LRNN 


Af ={(foal(A ) <— parent{A , P ) A horse(P),w m ), ( foal(A ) <— sibling(A, S) A horse(S), w n ), 
(horse(dakotta ), w \), (horse(cheyenne ), u^), (horse(aida), W 3 ), 

( parent(star , aida),w 6 ), ( parent(star , cheyenne) , w$), ( sibling(star , dakotta ), 1 / 14 )}. 


The LRNN J\f 


and its ground neural network are shown in Fig. 1. 


Rule-bodies 



Fact neurons Atoms neurons 



Figure 1: Depiction of the rule-based template (left) of LRNN A I from Ex. 2, and its cor¬ 
responding ground neural network A f (right), with colors denoting the predicate 
signatures, rectangular nodes corresponding to ground and circular to lifted lit¬ 
erals, respectively. 


What distinguishes LRNNs from ordinary neural networks the most is the following 
property. Having a pre-trained LRNN A f described by some general rules, we can extend 
it with description of a particular case to obtain a ground neural network and then use the 
latter for prediction. This is similar in spirit to lifted graphical models. 

Example 3 For instance, J\f may describe general rules for explosiveness of molecules (e.g. 
represented by a predicate explosive) and M\ and M .2 wiay be sets of (weighted) facts de¬ 
scribing two particular molecules. Then to use the LRNN J\T for predicting whether Ad \ 
and M 2 are explosive, we can simply construct ground NNs of Af L) Mi and Af U M 2 , and 
compute the output of the respective atom neurons explosive 1 € Af U M i and explosive 2 E 
Af U Ad 2 • As a distinctive feature of lifted models, the two ground LRNNs for the two exam¬ 
ple molecules may have very different size and structure because the least Herbrand models 
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H{ hl ) IH 1 bond(U 2 , h ] ) 
bond(h},h 2 ) H 2 | H ( h2 ) 




Figure 2: Two example molecules (left), described by surrounding sets of ground facts 
A4i and M 2 , are being merged with the lifted LRNN J\f, composed of general 
weighted rules loosely pointing to explosiveness of molecules (right), to form two 
ground networks displayed in Fig. 3. The rules in A f provide adaptive means 
to create latent groups {grf) of atom types (O... H) that, through a bond 
predicate (b(A, B)) connecting couples of atoms, form relational features (e.g. 
fi(A,B) <— gr±(A ) A bond(A, B) A gr 2 (B)), which set the basis for the final ex¬ 
plosiveness output. For the sake of space we assume a single relational (graphlet) 
feature f\ only. 


of AT* U M\ and of M* U M\, which determine the structures of the ground LRNNs, may 
be very different (because the structure and the size of the molecules described by Mi and 
M 2 are different). An illustration of this effect, for two example molecules and a template 
A f from Fig. 2, is displayed in Fig. 3. 

Depending on the used families of activation functions gy, g A and gf, we can obtain 
neural networks with different behavior. For intuitiveness, in order for rules (h •(— b\ A • • • A 
bk,w) to behave similarly to “if-then” rules, we should prefer the outputs of rule neurons to 
be high (e.g. close to 1) if and only if all the inputs from the atom neurons corresponding 
to the literals from the body of the rule have high outputs. Similarly, we should prefer the 
output of the atom neurons, which should intuitively behave similarly to disjunction, to be 
high if and only if at least one of the rule neurons or fact neurons, which are inputs for the 
given atom neuron, has high output. Logical operators from various fuzzy logics (Klir and 
Yuan, 1995) may serve as an inspiration for selecting suitable activation functions. 

Example 4 In Goedel fuzzy logic, conjunction b\ A • • • A bk, where bi are fuzzy logic literals, 
is given as min* bi and disjunction b\ V • • • V bk is given as max, bi. To emulate reasoning 
in Goedel logic, we could simply set g/\{b\, ..., bk) = min* bi, g\{bi,■ ■ ■, b m ) = max; bi, and 
gy(b\, ..., b m ) = max* bi. Here, the output of any rule neuron Rh^-bi/\—/\b k Is the minimum 
value which makes the fuzzy truth value of the implication h •(— b\ A • • • A bk equal to 1 
in the Goedel fuzzy logic. Likewise, the output of any aggregation neuron is the minimum 
value which makes the fuzzy truth value of all the respective ground implications equal to 1 
simultaneously. This way, LRNNs can emulate fuzzy logic programming. 
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Figure 3: Two groundings ATUiMi and Al’UAia formed by merging the two example 
molecules with the LRNN A f from Fig. 2. The shared predicate signatures and 
weights tied by the template are denoted by colors. For the sake of space we 
display only ground rule sets instead of complete ground networks (i.e., fact and 
aggregation neurons are omitted), Fig. 1 illustrates the (direct) correspondence 
of such a set to a full ground neural network. 


Next, we introduce two particular collections of activation functions inspired by fuzzy 
logic which will be used in the experiments (note that the activation functions shown in the 
above example would not be very suitable for gradient-based learning). 

Definition 2 (Max-Sigmoid Activation Functions) The Max-Sigmoid (MS) collection 
of activation functions is composed of the following three families of functions: g/\(b\,... ,b k ) = 
sigm (j2i= l h - k + 6 0 ) , g* A {h, ■ ■., b m ) = max, b it andg v (bi, ...,b k ) = sigm (Ya =1 bi + 6 0 ) • 

The rationale for this family of activation functions is as follows. As already mentioned, 
the activation function g A should have high output if and only if all its inputs are high. 
To achieve this, we can crudely approximate Lukasiewicz fuzzy conjunction, which is given 
as max{0, b\ + • • • + b k — k + 1}, by the function sigm (bi + ■ ■ ■ + b k — k + bo). A plot of 
the function sigm(b\ + ■ ■ ■ + b k — k + 1) is shown, for k = 2, in the left panel of Fig. 4. 
The activation function gf outputs the value equal to the highest of its inputs. Example 5 
illustrates that this can be seen as finding the best “match” of a pattern (rule). The 
activation function gv should have high output if at least one of the inputs is high or if all 
inputs are somewhat high. To satisfy this, we can crudely approximate Lukasiewicz fuzzy 
disjunction, which is given as min{l, &i+- ■ ■ -\-b k } by the function sigm (b\ + • • • + b k + bo). A 
plot of the function sigm(b\ + • • • + b k + 0) is shown in the right panel of Fig. 4. Example 6 
illustrates the intuition for the activation function g v - 
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Figure 4: A crude approximation of Lukasiewicz conjunction (left) and disjunction (right) 
by respective sigmoidal activation functions for the use in LRNNs. 


Example 5 Let us consider the LRNN 

Af ={(hasBrightEdge 4— isBright(E), 1), (isBright(E) 4— edge(E , U, V ) A bright(U ) A bright(V), 
1), (bright(U) 4— yellow(U),2), (bright(U) 4— red(U), 1), (bright(U) 4— blue(U),0.5)}. 

Let us also have a set Q describing a graph with colored vertices. 

Q ={{edge{e 1 ,v 1 ,v 2 ), 1), {edge(e 2 ,V2,v 3 ), 1), (edge(e 3 , v 3 , vfi), 1), (edge(e 4 , v 4 , v 3 ), 1), 

(red(v i), 1), ( blue(v 2 ), 1), {yellow(v 3 ), 1), (yellow(v 4 ), 1)} 

The output of the atom neuron AhasBrightEdge will only depend on the “brightest edge”, i.e. 
in this case on the edge e 3 . The output would be the same for any other colored graph Q', 
which would also contain an edge connecting two yellow vertices. Thus, for instance, if 
we considered some physicochemical property of atoms (e.g. their partial charge) instead of 
brightness of colors, and molecules instead of colored graphs, the corresponding networks 
could detect presence of a molecular substructure similar to a prescribed pattern. 

Example 6 Let us have the LRNN 

Af ={(highPressure(X) 4— stressed(X), 1), ( highPressure(X ) 4— obese(X), 1), 
(highPressure(X) 4— exercises(X), —1)} 

and the set of weighted facts V = {(stressed(alice), 1), (obese(alice), 1), (stressed(bob), 1), 
(exercises(bob), 1)}. Outputs of aggregation neurons corresponding to rules from Af with the 
same predicate in the head are combined using the activation functions g\j. Intuitively, rules 
and facts with the same predicate in the head can be seen as forming a logistic regression 
on the values given by the aggregation neurons from the lower layers. When the LRNN has 
just one layer, as in this example, one can achieve the same effect using techniques from 
propositionalization (Krogel et al, 2003) - treating the bodies of the rules as features and 
feeding them as attributes to a logistic regression classifier. However, as soon as the LRNN 
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has more layers, this effect cannot be emulated using propositionalization. In this particular 
example, if we construct the ground LRNN of Af U V then the output of the atom neuron 
AhighPressure(aiice) will b & higher than the output of the atom neuron AhighPressure^bob ) (because 
alice is stressed and obese whereas bob is just stressed and exercises). 

The Max-Sigmoid activation function is obviously not the only one possible. It is useful 
when we are interested in detecting one or more patterns (such as the existence of an edge 
as bright as possible in Example 5) but less useful in situations similar to the one depicted 
in the next example. 

Example 7 Let us consider the following simple LRNN for predicting individuals infected 
by flu 


Af ={(hasFlu(A ) friends(A, B ) A hasFluDiagnosed(B ), 1)} 

and a set of weighted ground facts V about a group of people and their friendships. If we 
constructed the ground neural networks of Af U V using the activation functions from the 
Max-Sigmoid family then the prediction of whether an individual has flu would be entirely 
based on the existence of at least one person who already had flu diagnosed. It would be 
obviously more meaningful to base the predictions on the fraction of one’s friends who had 
flu diagnosed. 

A family of activation functions which are more appropriate in situations similar to to 
the one described in the above example is given by the next definition. 

Definition 3 (Avg-Sigmoid Activation Functions) The Avg-Sigmoid (AS) collection 
of activation functions is composed of the following three families of functions: g A (b\,.. . ,b k ) = 

sigm (j2i=i b i - & + &o) , 9 a( & i> ■ ■ • > b m ) = A YUL i K and g v (b i, - - -, b k ) = Yh=\ b i + b o- 

Another advantage of the Avg-Sigmoid family of activation functions over the Max- 
Sigmoid family is also that the functions from the Avg-Sigmoid family are everywhere 
differentiable (which simplifies learning). We note that other activation function families 
based on combinations of different aggregation functions might also be exploited for LRNN 
learning. 

4. Some LRNN Modeling Constructs 

In this section we describe several constructs which are easy in LRNNs but which would be 
difficult or impossible to implement in other existing frameworks combining logic and neural 
networks solely because, unlike LRNNs, the other frameworks do not allow simultaneous 
learning of target and auxiliary predicates. Moreover, while somewhat similar constructs 
could in principle be used in probabilistic logic programming systems such as Problog (De 
Raedt et ah, 2007), when learning, they would require running costly EM algorithms which 
repeatedly need to perform computationally expensive probabilistic inference. 
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4.1. Implicit Soft Clustering 

In many domains one needs to create clusters of certain objects in order to achieve good 
generalization. This is the case e.g. in prediction of adverse effects of drugs where significant 
improvements in predictive accuracy were gained by methods which were able to create 
auxiliary clusters of similar drugs (Davis et al., 2012). However, the existing methods are 
still rather ad-hoc, relying on greedy discrete clustering. In LRNNs it is easy to define 
predicates representing these clusters, to train their weights automatically and use them for 
prediction of target predicates as illustrated by the following example. 

Example 8 Let us suppose that, similarly to ( Davis et al., 2012), we have temporal data 
about patients, drugs which the patients took and time instants when changes in health 
occurred. Let us also assume that we have a set of general rules like: 

: effect(P, AE, T2) took(P , Dl, Tl) A period(T\,T2 , T) A shortPeriod(T)f\ 

Atook(P, D2, T2) A drugGroupl(Dl) A drug Group 2{D2)/\ 

A effect Group 1(AE) 

: effectGroup 1(E) •<— headache(E) 

w 2 ' : effectGroup 1(E) <— sneezing(E) 

Using the Max-Sigmoid family of aggregation functions, weight learning in this LRNN can 
implicitly create clusters of drugs which interact adversely with other clusters of drugs and 
clusters of adverse effects corresponding to these combinations of drugs, as well as appro¬ 
priate definition for the predicate shortPeriod. 

While we were not able to perform experiments in the domain described in the above ex¬ 
ample because the data are not available for privacy reasons, we perform a simpler set of 
experiments in organic chemistry domains where the implicitly created soft clusters corre¬ 
spond to groups of atom types and atomic bond types. We describe these experiments in 
detail in Section 7. There we show that useful clusters are indeed created automatically 
by weight learning in LRNNs. One of the reasons for discussing the example about ad¬ 
verse effects of drugs here (in spite of the unavailability of the data) is to indicate that the 
machinery of LRNNs is very promising for existing problems for which only rather ad-hoc 
solutions exist currently. 

4.2. Soft Matching 

The next example explains the notion of a construct called soft matching and how it can 
be modeled in LRNNs. 

Example 9 Let us again consider the example about predicting flu. Let us suppose that we 
have the reasonable rule that if X is in a group of 4 people who are mutual friends and all 
of them have flu symptoms then X has flu 

: hasFlu(X) A- clique(W, X, Y, Z) A fluSymptoms(W) A fluSymptoms(X)A 
A fluSympt.oms(Y) A fluSymptoms(Z). 
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However, it is probably not necessary for W, X, Y and Z to be mutually friends in order 
for this rule to make sense. The rule is still valid, but maybe with lower certainty, if two 
of these four people are not actually friends, or maybe even if there are two such pairs 
or more. This is easily expressible in LRNNs by suitably defining the predicate clique and 
automatically learning the respective weights: 

4 2) : clique(W, X, Y, Z) <- f(W, X) A f(W, Y) A f(W, Z) A f(X, Y) A f(X, Z) A f(Y, Z) 

u>[ 3) : f(X, Y) <— friends(X, Y) A friendsfY , X) 

r4 3) : f(X, Y) <- friends{X, Y) 

wf ] :f(X,Y). 

Here, the predicate friends is assumed to be part of description of examples and soft matching 
of cliques is facilitated by the definition of the predicate f based on it. Using the activation 
functions from the Max-Sigmoid family for the predicates hasFlu and f, we can obtain the 
desired behavior with suitable weights. 

4.3. Other LRNN Concepts 

While soft clustering and soft matching are probably the modeling concepts which would 
be used most often in practice, there are other modeling concepts which are easily imple- 
mentable with LRNNs. One such other concept is low dimensional approximation of sets 
of (hyper)graph patterns which share structure but not labels, as exemplified below. 

Example 10 Let us consider the problem of predicting a property, e.g. toxicity, of organic 
molecules which depends on presence of substructures from certain rather large set. If the 
patterns have the same structure, e.g. they are all aromatic six-rings with substitutions 3 at 
some positions, one could in principle use probabilistic modeling to approximate this set by 
a probability distribution on the substitutions at different places so that the substitutions 
which are jointly occurring in the set of patterns woidd have high probability and the other 
substitutions small probability. While this probabilistic modeling approach is possible, it 
requires us to explicitly have the set of patterns. If the set of patterns should correspond to a 
latent concept, we would have to resort to EM. On the other hand, similar approximations 
to the latent set of patterns can be modeled in LRNNs quite easily. For instance, if we want 
to capture pair-wise dependencies of substitutions in neighboring atoms, we can first define 
auxiliary binary predicates 

: ei (carbon, nitrogen), wP : ei (carbon, oxygen),... 

Then, we can define a predicate 

wp : sixRing(A, B, C, D, E, F ) ring(A, B, C, D, E, F)f\e\{A, B)Ae 2 (B, C) A... ee(F, A) 

3. The basic aromatic six-ring is the benzene ring which is a ring of six carbon atoms, each connected to a 
hydrogen atom, connected by aromatic bonds. If some of the carbon atoms is replaced by another atom, 
we speak of a substitution. 
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and similar predicates for five-rings and other structures, and then construct rules for pre¬ 
diction of the property of interest (toxicity in this case) as follows: 

w\ : toxic(M) 4 — atom(M, A) A atom(M, B) A • • • A atom(M , F) A sixRing(A, B , C, D, E, F ) 
W 2 : toxic(M) 4— atom(M , A) A atom(M , B) A ■ ■ ■ A atom(M , E ) A fiveRing(A, B, C , D, E) 


Weight learning can then simultaneously adjust weights of the latent auxiliary predicates as 
well as the target predicates (we show this experimentally in Section 7). 

Exploiting the process of grounding of the lifted template, facilitating weight sharing in 
the ground networks, LRNNs can also emulate principal structures of convolutional neural 
networks (Krizhevsky et al., 2012) as the next example shows. 


Example 11 Let us consider a structure of the popular Convolutional Neural Network ar¬ 
chitecture composed of sparse convolutional layers alternated with max-pooling. Within the 
sparse layer, the weights corresponding to a single convolution filter are effectively bound to 
the same value while the filter is repeated across. Within selected subregions, the resulting 
feature-map values are then aggregated with application of max-pooling, i.e. only the maxi¬ 
mal values from each feature-map region are propagated further. This structural idea can be 
efficiently encoded by LRNN and generalized for feature maps (images) of varying size with 
the choice of Max-Sigmoid function family and a simple lifted template defined as follows 


« 4 1} 

■fl 

( 2 ) 

w\ 

: left(X) 

( 2 ) 

w\ 

: mid(X) 

( 2 ) 

w\ 

: right (X) 


left(A),mid(B),right(C), next(A, B),next(B, C ) 

fo(X) 

fo(X) 

MX) 


which corresponds to a convolution filter fi that can be bound to an arbitrary number of re¬ 
lational patterns, in this case simple linear segments of three neighboring features (A, B, C), 
of the input feature-map defined as a linearly ordered set of weighted facts about feature 
fo values (fo(X),v x ) (i.e. values v x of pixels X = {1 ...n}). The choice of Max-Sigmoid 
family then ensures max-aggregation to be applied on top of each such a convolutional layer. 
Visualization of a grounding of this template on a particular feature-map (image I) of five 
(n = 5) consecutive values (pixels) is provided in Fig 5. 


Other concepts which we do not describe in detail due to lack of space include e.g. 
relational auto-encoders. 


5. Weight Learning 

Let us have a LRNN J\f and a set of training examples £ = {£ l , ■ ■ ■ ,£ m } where each £i 
is some structure represented by a set of weighted propositions (e.g. left part of Fig. 2), 
i.e. a LRNN containing only facts 4 . Let us also have a set Q = {{(g{,t 4 ), ..., (q^, )} 

,..., {(qf 1 ,t r ('), ..., (q r f m , where q( are ground atoms, which we call training query 

4. The restriction of learning from facts only is actually not necessary but it will simplify this presentation. 
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(/o(1)>«i) ! (/o(2),t*)! (/o(3),t' 3 )! (/o(4),« 4 ) ! (/o(5),t's) 
<faext(l, 2 ), 0 ) (next{ 2 , 3 ), 0)(next{3, 4 ), 0)(next(4, 5 ), 0 )) 


image / as a vector of pixel values 


image / as a set of weighted facts 


Figure 5: A demonstration of a part of standard Convolutional Neural Network structure 
with sparse, convolutional layer composed of application of a filter f\ creating a 
feature-map layer followed by max-pooling (left). The same calculation structure 
is presented by a ground LRNN (right) efficiently encoded with a template from 
Example 11, generalizing over feature-vectors (images) of unrestricted size. 


atoms , and t J t are their target values. For any query atom qj. let yj denote the output of 

the atom neuron d j in the ground neural network of J\f U £i . The goal of the learning 

Q-i 

process is to find weights wh of the rules (and possibly facts) in A f minimizing cost J on 
the training query atoms J(Q) = Y^j=\ Yli=i C0S K'!Ji , tj) where cost is some predefined cost 
function which measures the discrepancy between the output of the atom neurons of the 
training query atoms and their desired target values. Similarly to conventional NNs, weight 
adaptation is performed by gradient descent steps 

dJ{Q) 

Wh^Wh- 7~K- 

dw h 

where 7 is some given learning rate. The main difference is that in the case of LRNNs, the 
ground neural networks may be very different for different learning examples £ 3 . However, 
this is not a fundamental problem because the weights for all the ground neural networks 
M U £' J are fully specified in the LRNN A f. 

Example 12 Let us demonstrate for clarity a sample scenario with Avg-Sigmoid activa¬ 
tion function family and a mean square error cost function, i.e. with each step we aim to 
decrease’ 

1 m kj 2 

JiaqEEH 1 j) - sigm(yj )) 

3 =1 »=1 

5. In this example, we pass the output from the output atom neurons and the target values through a 
sigmoid. This is useful when learning with the Avg-Sigmoid activation function family. An alternative 
would be to use cross-entropy as error function. 
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where the target values tj are given by Q and outputs y\ of individual atom neurons A^ are 
calculated as 



H Wk A 99 q (h^ bl A...AK 


,,Wk) 


~ W A i 


Q- 

where Agg^ h< _ biA Ab w ^ denotes the outputs of aggregation neurons forming the inputs of 
A j with respective ride weights w^, and wa , denotes the offset of activation function of 

q i qj 

the atom neuron A^. Since we have chosen the Avg-Sigmoid function family, the outputs 
of aggregation neurons are further calculated as 


Mi 


qi 1 

' 1 = — 

(h*^b\/\--/\b nk ,Wk) l 


E*. 


m= 1 


m A-’-Abn^ 6m 


where R „ , „ denotes outputs of respective input rule neurons formed from all 

Qi* tfm 

different groundings (substitutions 0 m ) of the rule Rh 6 rn ^b 1 9 m A---Ab nk e rn where hO m = q(. 
The output of the rule neurons can finally be calculated as 

R qMb 1 e rn A-Abn k em = sigm 

where A b j d denotes output of another (regular) atom neuron from the lower layers of the 

ground network M U £i corresponding to one of the ground body literals b o 0 m of the respective 
ground rule q( b\6 m A ■ • • A b nk 9 m . The calculation of Ay e can further be carried out 
in a recursive manner until the fact neurons are reached with fixed constant values 

defined by £ (or possibly M). We note that the whole evaluation composed of differentiable 
functions and the gradient can thus be calculated using regular chain rule. 



Moreover, the weights from A f can be repeated multiple times within a single Af U £fi 
but since recursion is not allowed, the same weight can appear at most once on any simple 
path from a fact neuron to an atom neuron. Therefore it is possible to learn the weights 
using conventional online stochastic gradient descent algorithm 1 ’, except that the increments 
for the shared weights must be accumulated, which is a simple consequence of linearity of 
partial differentiation. The same principle is exploited e.g. in learning of convolutional 
neural networks (Example 11). 

Remark 4 Let us consider a ground A f U £i as a regular feed forward neural network Nj 
with some weights 6 W J in the network being shared, i.e. bound to the same value, with 
the restriction that each particular weight Wk appears at most once on any simple path from 
input ej to output yj. Let the activation functions of layers l of Nj be f l E M from some 

6. Learning is slightly more complicated for LRNNs with the Max-Sigmoid family of activation functions 
because the max operator introduces non-differentiable points to the optimization problem. 
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set of differentiable functions. Let further w l k denote particular occurrences of some shared 
weight w^, then we might express the output of the network as 


Uj — f 1 • • + w kf 2 (■•■) + ■■• + w mf 2 (-1- u’kf 3 (■■■) + ■■■) + ■ ■ 


where .. ” correspond to expressions with no wk occurrence. Considering each Wk occur¬ 
rence separately as an independent variable, we have 


d Vj 

dw k 



f V (...)f 2 (...) 


d Vj 
d w l 



f v (...)w m f 2, (...)f 3 (...) 


Considering all occurrences of w l k as a single variable Wk, we have 


dw k 



= / 1 '(...) (f 2 (...)+w m f 2 '(...)f (...)) 


i.e., we see that which follows also directly from additivity of the differen- 

k LU k 

tiation operator (keeping in mind that there is only one occurrence of Wk on any simple path 
from an atom neuron to a fact neuron). Therefore gradient can be computed for the ground 
neural networks created from a given LRNN in the standard way and then the components 
corresponding to a particular weight Wk can be accumulated. 


Specifically, our weight-learning algorithm works as follows. First, it grounds the given 
LRNN J\f w.r.t. every example £ 3 from the dataset which gives it a set of ground neural 
networks M U £ 3 with shared weights (it keeps the information about the origin of each 
weight so that it could update the respective weights in the template in each step of the 
iteration). It then iterates over the ground networks in a random order, computes gradient 
of the error function for the current particular example given the current weights in the 
template, updates the weights accordingly and continues iterating these steps (i.e., the 
standard stochastic gradient descent procedure). In order to reduce the risk of getting 
stuck in poor quality local optima, we also employ a restart strategy for this algorithm. 


6. Related Work 

The main inspiration for the work presented in this paper are lifted graphical models such 
as Markov logic networks (Richardson and Domingos, 2006) or Bayesian logic programs 
(Kersting and De Raedt, 2001). However, none of these existing lifted graphical models is 
particularly well suited for learning parameters of latent relational structures. Our approach 
is also generally related to prior art in combining logical rules with neural networks, also 
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known as neural-symbolic integration (d’Avila Garcez et al., 2012), such as in the KBANN 
system. While the KBANN (Towell et al., 1990) also constructs the network structure 
from given rules, these rules are propositional rather than relational and do not serve as 
a lifted template. Therefore it is impossible to learn relational latent structures such as 
soft clustering of first-order-logic constants. A more recent system CILP+- |-(Franga et al., 
2014) utilizes a relational representation, which is however converted into a propositional 
form through a propositionalization technique (Krogel et al., 2003). This again means that 
latent relational structures such as those exemplified in Section 4 cannot be learned by 
CILP+-1- either. A somewhat more closely related paper on FONN (Botta et al., 1997) 
also designs a technique forming a network from relational rule set, however this rule set is 
flat, producing only 1-layer (shallow) networks in which relational patterns are not hierar¬ 
chically aggregated. While there are many other approaches of neural-symbolic integration 
aiming at relational (and first-order) representations (Bader and Hitzler, 2005), e.g. based 
on the CORE method (Holldobler et al., 1999), they typically search for a uniform model 
of the logic program in scope and thus principally differ from the presented lifted modeling 
approach. 

While standard feed-forward neural networks can be seen as a special case of LRNNs, 
since any such a fixed neural architecture can be encoded in a corresponding ground rule 
set with respective activation functions, a salient aspect of our method is that it allows for 
learning from structured (relational) examples, rather than just attribute vectors. There 
has been previous work on adapting neural networks to cope with certain facets of rela¬ 
tional representations. For example, extension to multi-instance learning was presented 
in (Ramon and De Raedt, 2000). A similarly directed work (Blockeel and Uwents, 2004) 
facilitated aggregative reasoning to process sets of related tuples from relational database as 
a sequence through recurrent neural network structure, which was also presented for more 
general structures in (Scarselli et al., 2009). These approaches are principally different from 
the presented method as they do not follow the lifted modeling strategy to cope with varia¬ 
tions in structure of relational samples. More loosely related works arise also in the neural 
networks community, where various recursive auto-encoders based on the idea of “reduced 
descriptions” (Hinton, 1990) are trained to encode structured data. Another line of work 
are convolutional neural networks (LeCun et al., 1998) and techniques of indirect encod¬ 
ing (Clune et al., 2011), exploiting patterns and regularities in neural connections to create 
more compressed representations of large neural networks. However, these approaches are 
still geared towards learning from fixed-length propositional rather than relational data. 

7. Experiments 

In this section we describe experiments performed on 78 datasets of organic molecules: 
Mutagenesis dataset (Lodhi and Muggleton, 2005), four datasets from the predictive toxi- 
collogy challenge and 73 NCI-GI datasets (Ralaivola et al., 2005). The Mutagenesis dataset 
contains 188 molecules with labels denoting their mutagenicity. A number of the results 
published on the mutagenesis dataset use extended set of features, providing additional 
expert knowledge on relational properties of molecules, degrading the role of learning ca¬ 
pabilities in relational models. We do not use any of the extra features as we utilize only 
atom-bond information. The predictive toxicology challenge dataset (PTC) (Helma et al., 
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2001) is composed of four datasets of molecules labeled by their toxicity for female rats 
(fr), mouse (fm) and male rat (rnr) and mouse (mm). Each of the NCI-GI datasets con¬ 
tains several thousands of molecules labeled by their ability to inhibit growth of different 
types of tumors. We compare performance of LRNNs to state-of-the-art relational learners 
kFOIL (Landwehr et al., 2006) and nFOIL (Landwehr et al., 2007), where kFOIL combines 
relational rule learninng with support vector machines and nFOIL combines relational rule 
learning with naive Bayes learning. 

For LRNNs we use a simple hand-crafted template which is based on the idea of implicit 
soft clustering described in Section 4.1 and is principally identical to the template discussed 
in Figure 2. The template defines 3 predicates for clusters of atom types and 3 predicates for 
clusters of bond types. The three predicates representing atom-type clusters are composed 
of exhaustive lists of atom types occurring in the datasets, e.g. w[ L ^ : atgrl(X) <— o(X), 
w ^ : atgrl(X) «— br(X), ... and similarly the predicates representing bond-type clusters 

are composed of exhaustive lists of bond types occurring in the datasets. These predicates 
are then used in definitions of predicates for different types of small chains of atoms of 
length 3, e.g. chainl 4— atgrl(X) A bond(X,Y, B 1) A atgrl(Y) A bond(Y, Z, B2) A atgr2(Z) 
A bondgrl(Bl) bondgr2(B2). These are finally used to define the target predicate, e.g. 
toxic. Using such a generic template for all the datasets, we make sure that there is no 
additional expert knowledge involved '. The idea is that in the process of learning, useful 
latent relational concepts are created within the neural network by the means of weight 
adaptation rather than by explicit enumeration, in contrast to propositional approaches 
and ILP (De Raedt, 2008). Indeed, none of the rules used in this template is useful on itself 
for prediction as a hard logic rule without weight adaptation. 

To set the parameters of LRNNs we use the empirical risk minimization principle on the 
training cross-validation folds to select the parameters such as step size, restarts, number 
of iterations, etc. This way we obtain unbiased estimates of performance of our methods 
since test data is never involved in parameter selection. The time for training a LRNN was 
in the order of few hours for the larger NCI-GI datasets. The results of the experiments 
are summarized in Figure 6. LRNNs perform clearly the best of the algorithms in terms of 
accuracy as they have lower prediction error than kFOIL and nFOIL on significant majority 
of datasets. We also tried to compare LRNNs with another recent algorithm combining logic 
and neural networks, called CILP+-(- (Franga et al., 2014), but we didn’t find it to perform 
well on our relational datasets as we were not able to obtain, using CILP-I—accuracy 
significantly higher than simple majority class error on any of the datasets 7 8 . 

For demonstration, we provide visualization of the latent grouping (clustering) LRNN 
layers for the Mutagenesis and for the PTC-mr datasets in Fig. 7. It is apparent from the 
learned weights in figures that the hidden layers are indeed learning useful latent groupings 
of atom types. It is interesting to note that on the Mutagenesis dataset, one of the learned 
groupings of atom types gives all atoms almost the same weight, which actually makes sense 
because it corresponds to a “wild-card” atom type. On the other hand, no similar behavior 

7. I.e., the template does not relate to any specific property of molecules and might be as well used for 
other classification tasks, too. 

8. While relatively reasonable results for Mutagenesis were reported in (Franga et al., 2014), the expert- 
knowledge attributes were used in the experiments reported therein, which might explain the discrepancy 
between the results. 
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Figure 6: Prediction errors of LRNNs, kFOIL and nFOIL measured by cross-validation on 
78 datasets of organic molecules. 


is typically found for the other datasets, which we have checked for different seeds of the 
PRNG used for initialization of weights. 



Figure 7: Visualization of latent concepts demonstrated through LRNN’s weights of rules 
defining particular groups of atoms (I\ appai.^) when learned in the Mutagenesis 
dataset (left) and in the PTC-rnr datasets. Lighter colors denote lower and darker 
colors higher weights, respectively. 


In order to test the modeling concept described in Section 4.3, we performed an addi¬ 
tional experiment with the Mutagenesis dataset. We used almost exactly the same template 
as in Example 10 but instead of ring structures we used chains of varying lengths (up to 5 
atoms). We trained the resulting LRNN to optimize the template’s weights, however here 
we were more interested in extracting the learned patterns. We determined the chains of 
atoms which gave the highest output for the learned latent predicates. We obtained the 
following atom chain structures: C-C-F, N-O, C-Cl, C-Br, C-C-O, O-N-C. At least some of 
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these structures appear to be directly relevant for the mutagenicity as they contain organic 
structures containing halogen atoms (Br, F and Cl). The other structures may be relevant 
to mutagenicity in combination with other structures. 

8. Conclusions 

In this paper, we have introduced a method combining relational-logic representations with 
feedforward neural networks. The introduced method is close in spirit to lifted graphical 
models as it can be viewed as providing a lifted model for construction of ground neural 
networks. The performed experiments indicate that it is possible to achieve state-of-the-art 
predictive accuracies by weight learning with very generic templates and that it is able to 
induce notable auxiliary concepts. There are many directions for future work, including 
structure learning, transfer learning or studying different collections of activation functions. 
An important future direction is also the question of extending LRNNs to support recursion. 
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